Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
today <- date()
paste("Today is", today, ". Hello R! Hello world!")
## [1] "Today is Thu Nov 19 16:11:49 2020 . Hello R! Hello world!"
How I found out about this course: I already came across this course last year (unfortunately didn’t have the time to participate) and was already intrigued about the contents and impressed by the fabulous feedback.
My background and expectations: As a graduate student in physics I got quite fond of using python for my data analysis work. Now, I am quite excited to see and compare what is possible with R - I am looking forward to participate in this interesting course! So far, the “warming up” week went fine and the introductions were easy to follow. The syntax shown so far didn’t look too complicated and partially familiar to python or python libraries such as pandas and matplotlib. As I am quite used to python (and it is more and more used in natural sciences), I don’t expect to switch to R after the course, but would love to learn more about it and achieve an understanding of what the pros and cons are. So if something comes up in the future for what R is actually more suited, I will be able to use it or – if collaborators use it – able to understand it. And furthermore, it’s always fun to learn new skills.:)
Link to my GitHub repository : https://github.com/kirstef/IODS-project
Looking forward to next week!
Describe the work you have done this week and summarize your learning.
In this chapter we will explore the dataset from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt, looking first closer at the dataframe and afterwards visualize some interesting features.
REMARK: The data wrangling part is done in the .R file in the data-folder. Still, I will show a preview of the original dataset in the diary as it shows the development from the original data to the dataset we finally evaluate.
First let’s read in the full learning2014 data as a dataframe from the given website.
lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
head(lrn14)
## Aa Ab Ac Ad Ae Af ST01 SU02 D03 ST04 SU05 D06 D07 SU08 ST09 SU10 D11 ST12
## 1 3 1 2 1 1 1 4 2 4 4 2 4 4 3 3 2 3 3
## 2 2 2 2 2 1 1 4 2 4 4 4 2 3 4 4 1 4 1
## 3 4 1 1 1 1 1 3 1 4 4 2 3 4 1 3 1 4 4
## 4 4 2 3 2 1 1 3 3 4 4 3 4 4 2 3 1 3 3
## 5 3 2 2 1 2 1 4 2 5 3 4 4 4 3 4 2 4 2
## 6 4 2 1 1 1 1 4 3 5 4 3 5 5 4 4 1 5 3
## SU13 D14 D15 SU16 ST17 SU18 D19 ST20 SU21 D22 D23 SU24 ST25 SU26 D27 ST28
## 1 3 4 3 2 3 2 4 2 3 3 2 2 4 4 4 4
## 2 3 2 3 4 4 2 3 1 2 2 3 4 2 4 2 2
## 3 2 4 2 3 3 1 4 3 2 4 3 3 4 4 3 5
## 4 2 4 3 2 3 1 3 3 3 3 3 2 3 2 3 3
## 5 3 4 3 3 4 1 4 3 2 3 3 4 4 3 3 5
## 6 1 5 4 2 3 2 4 3 4 5 4 2 4 2 5 4
## SU29 D30 D31 SU32 Ca Cb Cc Cd Ce Cf Cg Ch Da Db Dc Dd De Df Dg Dh Di Dj Age
## 1 3 4 4 3 2 4 3 4 3 2 3 4 3 4 4 5 4 2 4 3 4 4 53
## 2 3 3 4 5 4 4 4 5 5 3 2 4 4 3 3 4 3 2 3 3 2 4 55
## 3 2 4 3 5 3 5 4 4 3 4 4 2 1 4 4 1 4 1 3 1 1 5 49
## 4 3 4 4 3 3 4 4 4 3 4 4 3 2 4 5 2 5 1 5 4 2 5 53
## 5 3 3 4 4 2 4 4 3 3 3 4 4 3 4 4 4 4 2 5 5 3 3 49
## 6 2 5 5 3 3 5 4 4 3 4 5 4 3 5 4 4 4 3 4 3 3 5 38
## Attitude Points gender
## 1 37 25 F
## 2 31 12 M
## 3 25 24 F
## 4 35 10 M
## 5 37 22 M
## 6 38 21 F
Only looking at the head of the dataframe makes it quite visible that some data wrangling will help in understanding the data better. It also shows that meta data is quite valuable as we don’t know by just looking at the variable what their meaning is.
For this we have to include the library dplyr (library(dplyr)) which is a common R library used for data wrangling
library(dplyr)
First let’s use read.csv() or read.table() to read in the data subset created before and saved as a .csv in the project data-subfolder.
students2014 <- read.csv("./data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
With head(students2014) we can display the first six rows of the subset. It is now much better readable and interpretable than before.
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
Structure and dimension of the subset
With dim() and str() we can further explore the dimension and structure of the dataset.
dim(students2014)
## [1] 183 7
str(students2014)
## 'data.frame': 183 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim() already gives us that the subset consists of 7 variables and 183 observations. str() gives further information: The variables are those that we defined before (cf. R-script) and consist of 4 combined variables of type numerical which give us the grade that the students obtained in the respective categories in the scale (0-5), 2 integer type variables giving the age and the points, and one character type variable stating the gender.
Getting rid of 0 values We can filter our subset further to e.g. exclude the students who didn’t participate in the final exam and look afterwards again at the dimension.
# select rows where points are greater than zero
students2014 <- filter(students2014, points > 0)
dim(students2014)
## [1] 166 7
As one can see we have \(183-166=17\) rows with 0 points, so students not taking the final exam. Let’s keep it like this for the further exploration of the data set, as the reasons for not taking the exam are unknown to us and thus don’t necessary have a high relation to the attitude our other variables.
Using ggpairs gives us a good graphical overview of the data and the correlation of the different variables to each other. Using col=gender in the mapping argument let’s us also see gender-differences (color-coded, female(red), male(blueish)).
library(ggplot2)
library(GGally)
# create a plot matrix with ggpairs()
ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
Studying the plot in more detail, we see the following parts: scatter plots between the different variables, distributions of the different variables, and correlation values between the different variables. The higher the absolute value of the correlation value, the higher the correlation between the respective variables. As we can seen, the highest correlation seems to be between points and attitude. Looking at the scatter plot between these two variable we can also see that the data points are roughly scattered along a line, which hints as well at a relation between them. The point distribution for the attitude shows a visible gender difference. Where the distribution for the female students rises to about 2.5 to almost a “plateau” (until 3.5), the distribution for the male students shows only a small fraction of people having the grade 2 or less - after that there is a steep rise to the peak at a bit over 3. The surface question distribution (“surf”) shows a maximum at about 3 for the female students, which is higher than for the male students. For the age one can see (as expected) that most of the students are between 20 and 30 years old.
summary(students2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The summary() command gives us again an overview on the statistics of the different variables. We have about twice as many female students as male students, the mean age is between 25-26 years, the mean grade for the question types (deep, strategic, surface) lies between 2.8 and 3.7, with deep questions scoring highest. The mean attitude lies at 3.2 and the points at about 23. Apart from these mean values, we get further information: the minimum and maximum values, the 1st and 3rd quartiles and the median.
From the pairs plot above let’s choose the points as our target value and the 3 variables with the highest correlation to points as explanatory variables to be used for the regression model: exam points ~ x1 + x2 + x3. The respective variables would be attitude (corr: 0.437), stra (corr: 0.146) and surf (corr: -0.140).
summary(model) gives us further information on the validity and significance of our model. Further description and interpretation will still be added.
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
To find more out abouth the significance of the correlations we can look at t-value and Pr(>|t|). The understanding of the t-value is here, that it get’s bigger if the Null-Hypothesis is not true. The Null-Hypothesis in this case would be that there is no relation between the variable and the target value. The t-value on its own is difficult to interpret, but Pr(>|t|) gives us then the probability for getting the same t-value if the Null-Hypothesis would be true. As one can see, this probability is very low for attitude, which is also implied by the *** next to the value, which are described as significance codes in the legend. Thus, a relation between points and attitude is quite obvious. Both stra and surf don’t have any significant relation to the target variable, thus I will remove them from the model.
Just for fun once more the model with attitude and stra:
my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The t-value is a bit higher and the Pr(>|t|) a bit smaller, but still the relation is quite insignificant.
After having also remove the stra variable from the model, the model is reduced to the following one:
my_model3 <- lm(points ~ attitude, data = students2014)
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The values show no a clear significance of the relation between points and attitude. So what does that mean? Apparently, the attitude towards the course influence the final result in the exam (points). That relation is not really surprising: If you like a course and are interested in it, you are probably learning more or studying more for it and will probably easier get more points. The model created now could now be used to predict the points of a person with attitude=x.
Prediction of points for students with different attitudes
new_students <- c("Student X" = 2.5, "Student Y" = 4.8)
new_data <- data.frame(attitude = new_students)
# Predict the new students exam points based on attitude
predict(my_model3, newdata = new_data)
## Student X Student Y
## 20.45080 28.55936
Plotting the two variables against each other we can see that the predicted points are right where they should be (as it makes sense, because it is the same model). However, we can see that there is still a lot of scattering of the points above and below the regression lines. But, the tendency of the relationship is quite visible. The multiple R-squared value is a measure for how good our fitted model is and how strong the relation. The value of about 0.2 explains for about 20% variation from the mean. So the model might not be perfect, but still might be good. To judge this further, one has to take a look at the residuals.
library(ggplot2)
# plotting the two variables against each other with aesthetic mapping
p1 <- ggplot(students2014, aes(x = attitude, y = points, col=gender))
# using points for visualization
p2 <- p1 + geom_point()
# add a regression line and plot
p3 <- p2 + geom_smooth(method = "lm")
p3
## `geom_smooth()` using formula 'y ~ x'
The Residuals vs. Fitted Values plot shows that the errors are quite normally distributed both above and under the 0-line, which is a good sine for our model.
The QQ plot explains the behaviour for most of inner quantiles and only diverges from the line in the outer quantiles.
The residuals vs. Leverage plot can show us if some points have too much leverage (outliers). In this case there is no outlier visible that completely changes the trend of the curve.
#par(mfrow = c(2,2)) #To put the images in the same figures. However, I prefer them a bit bigger right now.
plot(my_model3, which=c(1,2,5))
The data wrangling part is performed in the R-script create_alc.R in the data folder and the created dataset is saved as a .csv file (‘alc.csv’) in the data folder.
The dataset we want to look at is a joined dataset from two datasets obtained from the following website: https://archive.ics.uci.edu/ml/datasets/Student+Performance , where it is also further described. It studies the performance of secondary education students of two Portuguese schools in mathematics and portuguese. The data sets content are answers of the students to different questions, which concern different areas of working and personal life, social background, circumstance etc.
Here, in our analysis, we will in particular look at the alcohol consumption and its relation to other variables.
To start the data analysis part, we include the library dplyr (library(dplyr)) and read in the dataset.
library(dplyr)
alc <- read.csv("./data/alc.csv")
To get an overview on the dataset, let’s print out the names of the variables and get the dimension values. The meaning of the values can be found on the above-mentioned website.
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
dim(alc)
## [1] 382 36
In total, we have 382 observations and 36 variables in the joined alc dataset.
Choose 4 interesting variables from the data and formulate hypothesis towards their relationship with alcohol consumption
Hypothesis 1: sex: There is a statistical difference concerning sex: Male students are more likely to have a high alcohol consumption than female students.
Hypothesis 2 failure: The number of past class failures has a relation to alcohol consumption (more failures leading to more alcohol consumption).
Hypothesis 3 famrel: Good family relationships make high alcohol consume less likely.
Hypothesis 4 goout: The frequency of going out with friends will affect the alcohol consumption.
With gather() and gplot() one can first get an overview bar plot for each variable.
library(tidyr); library(dplyr); library(ggplot2)
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Now we can continue by looking at the values interesting for the hypotheses stated above.
Hypothesis 1: gender and alcoholic consumption
alc %>% group_by(sex, high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 156
## 2 F TRUE 42
## 3 M FALSE 112
## 4 M TRUE 72
The cross-tabulation is supporting Hypothesis 1. Whereas 21.2121212% of the female students mention a high alcohol consumption, 39.1304348% of the male students have mentioned a high use, which is almost the double value. To look at this graphically, let’s look at a bar plot.
library(ggplot2)
g2 <- ggplot(alc, aes(x = high_use, col=sex))
g2 + geom_bar()
The bar plot shows quite clearly, that the difference between high-use and no high-use are quite different for male and female students.
Hypothesis 2 - failures: Let’s now investigate if one can see a clear relation between the failures and the alcohol consumption.
alc %>% group_by(high_use) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## high_use count
## <lgl> <int>
## 1 FALSE 268
## 2 TRUE 114
alc %>% group_by(failures) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## failures count
## <int> <int>
## 1 0 334
## 2 1 24
## 3 2 19
## 4 3 5
alc %>% group_by(high_use, failures) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups: high_use [2]
## high_use failures count
## <lgl> <int> <int>
## 1 FALSE 0 244
## 2 FALSE 1 12
## 3 FALSE 2 10
## 4 FALSE 3 2
## 5 TRUE 0 90
## 6 TRUE 1 12
## 7 TRUE 2 9
## 8 TRUE 3 3
alc %>% group_by(high_use, failures>0) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: high_use [2]
## high_use `failures > 0` count
## <lgl> <lgl> <int>
## 1 FALSE FALSE 244
## 2 FALSE TRUE 24
## 3 TRUE FALSE 90
## 4 TRUE TRUE 24
As one can see, 5 of the 382 students have experienced 4 failures. From these 5, three report high alcohol consumption. This sample might however be to small to judge if a real relationship exist. Let’s create a bar and a box plot to investigate further.
g3 <- ggplot(alc, aes(x=failures, col=high_use))
g3 + geom_bar()
g4 <- ggplot(alc, aes(x = sex, y = alc_use, col=failures>0))
g4 + geom_boxplot()
Looking at the bar plot, one might at least come to the conclusion, that having experienced no failure makes it less likely to become addicted to high alcoholic consumption. However, the sample for having a failure is much less than for having none. It might make more sense to combine the numbers for failures > 1 together. Now, a box plot gives a nice visualisation of how having experienced at least 1 failure can impact the alcoholic consumption. Again, the impact seems to be much higher for the male students.
Hypothesis 3 - Family:
alc %>% group_by(famrel,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: famrel [5]
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 10
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 135
## 8 4 TRUE 54
## 9 5 FALSE 78
## 10 5 TRUE 24
g5 <- ggplot(alc, aes(x = high_use, y = famrel))
g5 + geom_boxplot() + ylab("quality of family relationships (from 1 - very bad to 5 - excellent)")
g6 <- ggplot(alc, aes(x=famrel, col=high_use))
g6 + geom_bar()
Both the box plot as well as the bar plot seem to at least give a trend on the correctness of the hypothesis: Whereas the mean value for the quality of the family relationships is at 3.5 for people stating a high alcohol consumption, it has a better mean quality (4.5) for students without hinting at alcohol problems. The bar plot gives a bit more clearer view on the different family relationship bins. The high-use fraction is getting more for famrel=2 or 3 than for 4 and 5. For famrel=1 the sample is probably to small to be judge.
Hypothesis 4 - going out
alc %>% group_by(goout,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: goout [5]
## goout high_use count
## <int> <lgl> <int>
## 1 1 FALSE 19
## 2 1 TRUE 3
## 3 2 FALSE 84
## 4 2 TRUE 16
## 5 3 FALSE 103
## 6 3 TRUE 23
## 7 4 FALSE 41
## 8 4 TRUE 40
## 9 5 FALSE 21
## 10 5 TRUE 32
g7 <- ggplot(alc, aes(x = high_use, y = goout, col=sex))
g7 + geom_boxplot() + ylab("going out with friends (from 1 [very low] to 5 [very high]")
g8 <- ggplot(alc, aes(x=goout, col=high_use))
g8 + geom_bar()
Again, the numeric cross tabulations as well as the plots are supporting the hypothesis. Going out with friends more often (higher value) seems to have an impact on the alcohol consumption.
# find the model with glm()
m <- glm(high_use ~ sex + failures + famrel + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + failures + famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6492 -0.7464 -0.5207 0.8642 2.4194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3922 0.6415 -3.729 0.000192 ***
## sexM 0.8976 0.2530 3.548 0.000388 ***
## failures 0.3339 0.2034 1.641 0.100696
## famrel -0.3971 0.1390 -2.857 0.004281 **
## goout 0.7751 0.1217 6.367 1.93e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 391.08 on 377 degrees of freedom
## AIC: 401.08
##
## Number of Fisher Scoring iterations: 4
The model summary shows a high significance for the correlation to sexM and goout, a medium significance for famrel und no signifant value for failure. One could fit the model again, dropping at least the failure variable. But first let’s study the odds ratios (OR).
coef(m)
## (Intercept) sexM failures famrel goout
## -2.3922069 0.8976484 0.3338923 -0.3971296 0.7751449
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09142769 0.02506625 0.3124735
## sexM 2.45382585 1.50205707 4.0576659
## failures 1.39639281 0.93864152 2.0963627
## famrel 0.67224693 0.51012773 0.8814055
## goout 2.17090677 1.72124671 2.7771652
The odds ratios can tell us if having a certain property (e.g. being male, going out often) can be positively or negatively associated with the logical value looked at (here high alcohol consumption). If the OR is > 1, it means it is positively associated, with < 1 it is negatively associated.
The male sex variable has the strongest relationship to the binary target variable, with an odds ratio of +2.45 - that means that the probability for a male student having high alcohol consumption is about 2.5 higher than for a female student. “going out” is also positively associated with high alcohol consumption, wheres “family relation” has a negative relation (high quality of family relation leading more likely to low alcohol consumption). These OR thus support the above-mentioned hypotheses.
As failures didn’t have a significant relationship according to my logistic regression model, I will drop this variable and fit the model again, print out the summary and calculate the OR and Confidence levels:
m2 <- glm(high_use ~ sex + famrel + goout, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ sex + famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5836 -0.7707 -0.5316 0.8198 2.5474
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3550 0.6414 -3.672 0.000241 ***
## sexM 0.9342 0.2514 3.716 0.000202 ***
## famrel -0.4118 0.1387 -2.969 0.002989 **
## goout 0.7971 0.1214 6.565 5.2e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 393.80 on 378 degrees of freedom
## AIC: 401.8
##
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept) sexM famrel goout
## -2.3549669 0.9341541 -0.4117517 0.7971166
OR <- coef(m2) %>% exp
CI <- confint(m2) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09489664 0.02602013 0.3241397
## sexM 2.54505957 1.56349803 4.1969640
## famrel 0.66248876 0.50299888 0.8679153
## goout 2.21913298 1.76086023 2.8372779
As a next step let’s check the predictive power of the new model, adding the new columns probability and predictions to the alc dataset. probablities will be calculated by using the predict() function on the model and predictions will defined as TRUE if the probability is higher than 50%.
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)
To see an example on how good the model is, we can print the first rows of the columns we are interested in and look at the confusion matrix:
select(alc, sex, famrel, goout, high_use, probability, prediction) %>% tail(10)
## sex famrel goout high_use probability prediction
## 373 M 4 2 FALSE 0.18639810 FALSE
## 374 M 5 3 TRUE 0.25195331 FALSE
## 375 F 5 3 FALSE 0.11687357 FALSE
## 376 F 4 3 FALSE 0.16650200 FALSE
## 377 F 5 2 FALSE 0.05627990 FALSE
## 378 F 4 4 FALSE 0.30714360 FALSE
## 379 F 2 2 FALSE 0.17019623 FALSE
## 380 F 1 1 FALSE 0.12243164 FALSE
## 381 M 2 5 TRUE 0.85084788 TRUE
## 382 M 4 1 TRUE 0.09357856 FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 250 18
## TRUE 63 51
To graphically visualize the actual values and the predictions, we can draw a plot of ‘high use’ vs. ‘probability’.
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()
Furthermore, we can do another cross-tab of predictions vs. actual values, printing out the fractions instead and adding margins.
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65445026 0.04712042 0.70157068
## TRUE 0.16492147 0.13350785 0.29842932
## Sum 0.81937173 0.18062827 1.00000000
Accuracy of the model The proportion of inaccurately classified individuals (training error) can be calculated by taking the values from the confusion matrix: (18+63)/(250+18+63+51) = 0.21. We will get the same value by defiing a loss function as below:
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2120419
The loss function gives us a value of about 0.21 (as also calculated before), meaning that about 21% of the predictions will be wrong. This value is better than the value in the DataCamp example (about 26%), making this model slightly better.
To study the test set performance of the model one can use K-fold cross-validation. Here, we will try out 10-fold cross-validation, making use of the loss function defined above.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K=10)
cv$delta[1]
## [1] 0.2303665
The cv.glm function of the library boot can be used for this and K=10 defines the number of subsets we are doing the cross-validation on. The delta attribute stores the error value. It gives an error of about 0.23 which is better than the 0.26 error of the DataCamp model. Thus, it has a bit better performance.
This chapter is focused on clustering and classification and the Boston dataset will be the foundation for this. The dataset is already included in the MASS package so it can be loaded directly.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
To get a hinch on what dataset we are dealing with, let’s look at its dimensions (dim(Boston)) and structure (str(Boston)) .
## [1] 506 14
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The dataset is made up of 506 observations and 14 variables, describing the housing values in suburbs of Boston. Two of the variables (chas and rad) are integer type, the other numeric. Further information can be found at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html. Describe a bit more?
“Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.”
Let’s see if a `pairs() plot can show us some interesting things:
A summary (summary(Boston)) can give us more insight:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Both the pairs plot and the summary are not easy to interpret just by a first glance. Another option is to make use of a correlation plot - thus to graphically print the correlation between the different variables.
library(corrplot)
library(tidyverse)
cor_matrix <- cor(Boston) %>% round(digits=2)
corrplot(cor_matrix, method="circle", cl.pos="b")
In this plot we can nicely see, which variables are correlated the most by looking at the colours/diameters of the circles. For example those circles with a darker shade of blue have a correlation value close to 1 and one can compare with the pairs plot above and see that they have a linear tendency (e.g. nox and indus - denoting “nitrogen oxides concentration” and “proportion of non-retail business acres per town”). Then there are those variables which relationship to each other is visualised by a very small and nearly white circle, and those are the once with no significant relationship (e.g. rm and black - “average number of rooms per dwelling” vs. a value of black population in the town), which is also visible in the pair plot.
“Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.”
Scaling the data
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled) # save data in a dataframe
head(boston_scaled)
## crim zn indus chas nox rm age
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
## dis rad tax ptratio black lstat medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## 6 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582
One can see that as a result to the scaling the values are now distributed around a mean value of 1, running thus from negative to positive.
Categorical variable for the crime rate
The “per capita crime rate by town” can be found in the dataset in the continuous variable “crim”. This continuos variable should now be changed into a categorical one, using quantiles as break boints. Let’s look at the quantiles first with summary(boston_scaled$crim) and then define our bins:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins <- quantile(boston_scaled$crim)
The new factor variable crime will hold the crime data, categorized by “low”, “med_low”, “med_high” and “high”.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Instead of the old crim variable, let’s now add the new crime variable to the boston_scaled dataframe and drop the old variable crim.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Divide the data set into training and test sets
To divide the dataset into training and test sets we first need to know the length of the set, which we can do by using nrow(). A common partition is to use 80% of the data for training and 20% for testing, which we will do here. With `sample() we can randomly choose a percentage of the given numbers and store them in a variable. Then we can build our train and test set with the randomly chosen rows.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
A table overview on the crime categories of the train and test gives us a first view on the distribution of the different groups.
traincrime <- table(train$crime)
testcrime <- table(test$crime)
traincrime
##
## low med_low med_high high
## 95 106 99 104
testcrime
##
## low med_low med_high high
## 32 20 27 23
“Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.”
LDA Fit on the crime training set
Linear Discriminant Analysis can be used as a classification method to fine a linear combination of the variables in relation to the target variable classes. It is fir with the function lda() and takes as an input a function (e.g. target ~ x1 + x2 …) and the dataset. target ~ . means that all other variables in the dataset are used as predictors.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2351485 0.2623762 0.2450495 0.2574257
##
## Group means:
## zn indus chas nox rm age
## low 1.0811603 -0.9825242 -0.10655643 -0.9164956 0.49638073 -0.9388117
## med_low -0.1221778 -0.2852713 -0.01233188 -0.5831439 -0.11089597 -0.3821796
## med_high -0.3824294 0.1029616 0.20489520 0.3174882 0.05826981 0.3626050
## high -0.4872402 1.0170690 -0.08304540 1.0798617 -0.44482011 0.8183861
## dis rad tax ptratio black lstat
## low 0.9590141 -0.6881054 -0.7230348 -0.4710253 0.37471146 -0.820047636
## med_low 0.3724830 -0.5571551 -0.4966790 -0.0491823 0.32408079 -0.135668620
## med_high -0.3457382 -0.4041575 -0.3345772 -0.2482057 0.07861649 -0.008930089
## high -0.8426763 1.6386213 1.5144083 0.7813507 -0.80807049 0.886694051
## medv
## low 0.580712998
## med_low -0.005208233
## med_high 0.145909097
## high -0.733834522
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11430535 0.728802130 -0.80474666
## indus 0.01088991 -0.366450602 0.50023685
## chas -0.08499137 -0.080930867 0.05371688
## nox 0.44644742 -0.532510403 -1.39644487
## rm -0.13330229 -0.003029865 -0.03554794
## age 0.23994415 -0.261955227 -0.35291294
## dis -0.08053323 -0.081411843 0.18130267
## rad 3.25984464 0.765444370 0.20069191
## tax -0.11922263 0.286355031 0.31372941
## ptratio 0.10783403 -0.022040620 -0.22969013
## black -0.14580229 0.020027875 0.15084136
## lstat 0.20089949 -0.204690253 0.48426660
## medv 0.14482769 -0.308316927 -0.20645912
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9482 0.0381 0.0137
The print of the fit shows us the result of the LDA fit on the train set, with the output showing us the three extracted discriminant functions LD1, LD2 and LD3 with the highest value being 0.94 for LD1. We get three discriminant values as we have four different groups (“low”,“med_low”, “med_high”, “high”) and discrimants are always one less than the number of groups.
LDA (bi)plot
For the LDA biplot a function has to be created to show the lda arrows for the different variable in the plot. The code for this function was taken from the datacamp exercise which refers to following Stack Overflow message thread. Then the classes are stored in a numeric vector and plotted, together with the arrows, in a LDA (bi)plot.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2,col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
From the plot we can see that the discriminant LD1 was hugely imfluenced by the variable rad. Looking at the definition for rad (“index of accessibility to radial highways”), it seems that the accessibility to radial highways as a predictor plays a role in being placed into the high category.
“Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.”
Before going to the next step, let’s save the crime categories from the test set in correct_classes and the remove crime from the test dataset.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Now the lda.fit model can be used to predict the crime values of the before-created test dataset:
lda.pred <- predict(lda.fit, newdata = test)
A cross-tabulation with the original crime values (stored in correct_classes) can give a nice overview on how well the fitting worked.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 14 4 0
## med_low 4 10 6 0
## med_high 0 4 22 1
## high 0 0 0 23
Looking at the cross-tabulation one can see that for this data set the high prediction is perfect, the med-high classification quite good, but for the lower categories low and high the model could be improved.
Reload the Boston dataset and scale it to standardize the data, before comparing the distances.
# load MASS and Boston
library(MASS)
data('Boston')
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2) # save data in a dataframe
Distances
Without specifying the method as an attribute in the dist function, the euclidian distance is calculated. The method can be changed, e.g. to method="manhattan" which will than have different results.
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
K-Means clustering
Let’s do K-Means clustering on the scaled dataset, starting with 4 as a first try for the number of cluster centers. The results can be looked at with pairs.
km <-kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2, col = km$cluster)
As it is difficult to read anything in the total pairs plot, let’s divide it into 4 parts:
pairs(boston_scaled2[1:4], col = km$cluster)
pairs(boston_scaled2[5:8], col = km$cluster)
pairs(boston_scaled2[9:11], col = km$cluster)
pairs(boston_scaled2[12:14], col = km$cluster)
I will stick to the [5:8] part of the dataset and investigate how the number of clusters (2,3,5) will change the result of the initial number of 4 clusters.
km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2[5:8], col = km$cluster)
km <-kmeans(boston_scaled2, centers = 3)
pairs(boston_scaled2[5:8], col = km$cluster)
km <-kmeans(boston_scaled2, centers = 5)
pairs(boston_scaled2[5:8], col = km$cluster)
It seems to be difficult to decide which number of clusters is the best. One can use the total of within cluster sum of squares (WCSS) to help with the decision. The optimal number of clusters can be seen as the point when the total WCSS is dropping radically. Let’s investigate the behaviour of the total WCSS from 1 cluster to 10 clusters.
set.seed(123) # use a certain seed for the initial cluster centers
k_max <- 10 # set the maximum number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss}) #calculate the WCSS
qplot(x = 1:k_max, y = twcss, geom = 'line')
The total WCSS is falling steeply until ~2 clusters, afterwards a bit less and between 3 and 4 it is again rising and falling. So we shouldn’t use more than 3 clusters and 2 clusters would probably be the optimal value.
Let’s rerun the K-means clustering with 2 for the whole dataset.
km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2, col = km$cluster)
(more chapters to be added similarly as we proceed with the course!)